Paragraph or two introducing your datasets and variables, why they are interesting to you, etc. See instructions for more information
library(tidyverse)
library(stevedata)
election_turnout
## # A tibble: 51 x 14
## year state region division turnoutho perhsed percoled gdppercap ss trumpw
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 2016 Alab… South East So… 59 84.3 23.5 42663 0 1
## 2 2016 Alas… West Pacific 61.3 92.1 28 81801 0 1
## 3 2016 Ariz… West Mountain 55 86 27.5 43269 0 1
## 4 2016 Arka… South West So… 52.8 84.8 21.1 41129 0 1
## 5 2016 Cali… West Pacific 56.7 81.8 31.4 61924 0 0
## 6 2016 Colo… West Mountain 70.1 90.7 38.1 58009 1 0
## 7 2016 Conn… North… New Eng… 65.2 89.9 37.6 72331 0 0
## 8 2016 Dela… South South A… 64.4 88.4 30 69930 0 0
## 9 2016 Dist… South South A… 60.9 89.3 54.6 181185 0 0
## 10 2016 Flor… South South A… 64.6 86.9 27.3 42595 1 1
## # … with 41 more rows, and 4 more variables: trumpshare <dbl>, sunempr <dbl>,
## # sunempr12md <dbl>, gdp <dbl>
library(cluster)
pam_dat <- election_turnout %>% dplyr::select(turnoutho, trumpshare,
percoled) %>% scale
sil_width <- vector() #empty vector to hold mean sil width
for (i in 2:10) {
pam_fit <- pam(pam_dat, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k",
breaks = 1:10)
# k=2 has the highest sil width
pam1 <- pam_dat %>% pam(k = 2)
library(plotly)
final <- election_turnout %>% select(-ss, -trumpw, -year, -state,
-division) %>% mutate(cluster = as.factor(pam1$clustering))
final %>% plot_ly(x = ~sunempr, y = ~gdppercap, z = ~percoled,
color = ~cluster, type = "scatter3d", mode = "markers", symbol = ~region,
symbols = c("circle", "x", "o"))
library(GGally)
ggpairs(final, columns = 2:9, aes(color = cluster))
pam1$silinfo$avg.width
## [1] 0.3775957
plot(pam1, which = 2)
election_turnout %>% slice(pam1$id.med)
## # A tibble: 2 x 14
## year state region division turnoutho perhsed percoled gdppercap ss trumpw
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 2016 Sout… South South A… 56.7 85.6 25.8 40212 0 1
## 2 2016 Wash… West Pacific 64.8 90.4 32.9 62213 0 0
## # … with 4 more variables: trumpshare <dbl>, sunempr <dbl>, sunempr12md <dbl>,
## # gdp <dbl>
Include a paragraph or two describing results found, interpreting the clusters in terms of the original variables and observations, discussing goodness of fit of the cluster solution, etc.
I first chose the three variables turnoutho,trumpshare, and percoled and then conducted a PAM analysis of k=2 because it had the highest average silhouette width of .377. The two clusters were similar in both state level unemployment rate measurements, but differed the most in percentage of the state that completed high school (pershed) and voter turnout for the highest office as percent of voting-eligible population (turnoutoh). These two clusters’ mediods are South Carolina and Washington, and the most prominent distinction is if Trump won the state or not in the 2016 election. The cluster with South Carolina had lower percentage of completed high school and completed college (pershed/percoled), lower GDP per cap (gdppercap), lower turnout rates (turnoutoh), and higher share of the vote Trump received (trumpshare). The interpretation of the average silhouette width is that the structure is weak and could be artificial.
pca_dat <- election_turnout %>% dplyr::select(-ss, -trumpw, -year,
-state, -division, -region, -sunempr12md) %>% scale
pcac <- princomp(pca_dat, cor = T)
summary(pcac, loading = T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.6741666 1.4003838 0.9412238 0.8398521 0.5872072
## Proportion of Variance 0.4004048 0.2801535 0.1265575 0.1007645 0.0492589
## Cumulative Proportion 0.4004048 0.6805584 0.8071158 0.9078803 0.9571392
## Comp.6 Comp.7
## Standard deviation 0.46411810 0.29089462
## Proportion of Variance 0.03077223 0.01208853
## Cumulative Proportion 0.98791147 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## turnoutho 0.348 0.251 0.334 0.769 0.283 0.174
## perhsed 0.359 0.501 -0.272 -0.733
## percoled 0.565 -0.132 -0.116 0.216 -0.773
## gdppercap 0.438 -0.270 -0.460 -0.548 0.248 0.399
## trumpshare -0.462 0.321 0.177 -0.640 0.270 -0.402
## sunempr -0.161 -0.500 -0.405 0.600 -0.386 -0.217
## gdp -0.492 0.707 -0.453 -0.228
eigval <- pcac$sdev^2 #square to convert SDs to eigenvalues
varprop = round(eigval/sum(eigval), 2) #proportion of var explained by each PC
ggplot() + geom_bar(aes(y = varprop, x = 1:7), stat = "identity") +
xlab("") + geom_path(aes(y = varprop, x = 1:7)) + geom_text(aes(x = 1:7,
y = varprop, label = round(varprop, 2)), vjust = 1, col = "white",
size = 5) + scale_y_continuous(breaks = seq(0, 0.6, 0.2),
labels = scales::percent) + scale_x_continuous(breaks = 1:10)
library(factoextra)
fviz_pca_biplot(pcac)
Include a paragraph or two describing results with a focus on what it means to score high/low on each PC you retain (interpreting each PC you retained in terms of the original variables/loadings); also discuss how much of the total variance in your dataset is explained by these PCs.
To score high on PC1 means that the state has a low trumpshare, and a low state state-level unemployment rate entering Nov. 2016 (sunempr). The state has high voter turnout, percent of state that completed both high school and college, and also has a high GDP per capita. If the state has a low PC1 score, it is the opposite. This PC focuses on high educational attainments vs trumpshare. For PC2, a high score means a high percentage of high school completion, a high share of Trump supporters, and a large voter turnout. The high PC2 scores will have minimal unemployment rates, GDP related variables, and percent of college graduates. This PC seems to focus on unemployment rates vs percent high school completion. Finally, for PC3, there are only four variables. To score high on PC3 means having a high state GDP and voter turnout, but low or minimal GDP per capita and unemployment rates. Scoring low is the opposite- high GDP per capita and high unemployment rates, but low GDP and voter turnout. This PC focuses on GDP vs GDP per capita. With the 3 principal components, 80.71% of the total variance in the dataset is explained.
numerics <- election_turnout %>% dplyr::select(-ss, -year, -state,
-division, -region, )
logistic_fit <- glm(trumpw == "True" ~ ., data = numerics, family = "binomial")
prob_reg <- predict(logistic_fit, type = "response")
class_diag(prob_reg, truth = election_turnout$trumpw, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.4118 0 1 NaN NaN 0.5 0.4794
lin <- lm(trumpw == "True" ~ ., data = numerics)
score <- predict(lin)
class_diag(score, election_turnout$trumpw, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.4118 0 1 NaN NaN 0.5 0.5
k = 10
data <- sample_frac(numerics) #randomly order rows
folds <- rep(1:k, length.out = nrow(data)) #create folds
diags <- NULL
i = 1
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$trumpw
# train model
fit <- glm(trumpw == "True" ~ ., data = train, family = "binomial") ### SPECIFY THE LOGISTIC REGRESSION MODEL FIT TO THE TRAINING SET HERE
# test model
probs <- predict(fit, newdata = test, type = "response") ### GET PREDICTIONS FROM THE TRAINED MODEL ON THE TEST SET HERE
# get performance metrics for each fold
diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
glimpse(election_turnout)
## Rows: 51
## Columns: 14
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 201…
## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California",…
## $ region <chr> "South", "West", "West", "South", "West", "West", "Northe…
## $ division <chr> "East South Central", "Pacific", "Mountain", "West South …
## $ turnoutho <dbl> 59.0, 61.3, 55.0, 52.8, 56.7, 70.1, 65.2, 64.4, 60.9, 64.…
## $ perhsed <dbl> 84.3, 92.1, 86.0, 84.8, 81.8, 90.7, 89.9, 88.4, 89.3, 86.…
## $ percoled <dbl> 23.5, 28.0, 27.5, 21.1, 31.4, 38.1, 37.6, 30.0, 54.6, 27.…
## $ gdppercap <int> 42663, 81801, 43269, 41129, 61924, 58009, 72331, 69930, 1…
## $ ss <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ trumpw <int> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, …
## $ trumpshare <dbl> 0.62083092, 0.51281512, 0.48671616, 0.60574102, 0.3161710…
## $ sunempr <dbl> 5.8, 6.9, 5.2, 3.8, 5.4, 2.9, 4.9, 4.5, 6.0, 4.7, 5.3, 2.…
## $ sunempr12md <dbl> -0.2, 0.3, -0.6, -0.6, -0.3, -0.6, -0.7, -0.2, -0.5, -0.4…
## $ gdp <dbl> 203829.8, 49363.4, 311091.0, 120374.8, 2657797.6, 329368.…
# average performance metrics across all folds
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.41 0 1 NaN NaN 0.5 0.51389
The logistic regression model has a low scoring AUC of .479, which is worse than a 50-50 chance of guessing. The accuracy is also very low. However, k-fold CV predicts almost 50% more effectively than the logistic regression. I do not see the typical signs of overfitting, in fact, it is just the opposite. The logistic regression model seems to be over fitting more than any of the CV tests.
library(caret)
knn_fit <- knn3(factor(trumpw == 1, levels = c("TRUE", "FALSE")) ~
., data = numerics)
prob_knn <- predict(knn_fit, numerics)
class_diag(prob_knn[, 1], election_turnout$trumpw, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.6667 0.8 0.4762 0.6857 0.7385 0.6381 0.7278
table(truth = factor(election_turnout$trumpw == 1, levels = c("TRUE",
"FALSE")), prediction = factor(prob_knn[, 1] > 0.5, levels = c("TRUE",
"FALSE"))) %>% addmargins
## prediction
## truth TRUE FALSE Sum
## TRUE 24 6 30
## FALSE 11 10 21
## Sum 35 16 51
k = 10 #choose number of folds
data <- sample_frac(numerics) #randomly order rows
folds <- rep(1:k, length.out = nrow(data)) #create folds
diags <- NULL
for (i in 1:k) {
## Create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$trumpw ## Truth labels for fold i
## Train model on training set (all but fold i)
fit <- knn3(trumpw ~ ., data = train)
## Test model on test set (fold i)
probs <- predict(fit, newdata = test)[, 2]
## Get diagnostics for fold i
diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.45667 0.68334 NaN NaN NaN NaN 0.49167
The kNN model does much better than logistic regression, but still gets a ‘C’ score. After CV though, the AUC drops to .54 and makes it a “bad” model. I see signs of overfitting because of the drastic change in AUC. This nonparametric model does worse with the cross-validation performance when compared to the logistic regression.
fit <- lm(trumpshare ~ percoled + gdppercap, data = election_turnout)
yhat <- predict(fit)
mean((election_turnout$trumpshare - yhat)^2)
## [1] 0.006015025
k = 5 #choose number of folds
data <- election_turnout[sample(nrow(election_turnout)), ] #randomly order rows
folds <- cut(seq(1:nrow(election_turnout)), breaks = k, labels = F) #create folds
diags <- NULL
for (i in 1:k) {
train <- data[folds != i, ]
test <- data[folds == i, ]
## Fit linear regression model to training set
fit <- lm(trumpshare ~ percoled + gdppercap, data = train)
## Get predictions/y-hats on test set (fold i)
yhat <- predict(fit, newdata = test)
## Compute prediction error (MSE) for fold i
diags <- mean((test$trumpshare - yhat)^2)
}
mean(diags)
## [1] 0.01088507
There are no signs of overfitting, in fact the model does better when it is cross-validated. However, overall both models perform exceptionally well with an almost near 0 MSE. The CV is only .003 better than the linear regression model fitted to the entire dataset.
library(reticulate)
word1 <- "you"
word2 <- "all"
equal <- "="
word3 = "y'all"
print(r.word1 + " "+ r.word2)
## you all
cat(c(word1, word2, equal, py$word3))
## you all = y'all
I first declared some words in the R and python chunk, then I printed in the python code using the words from the R chunk and referred to them with r.varname. Then, at the last R chunk I combined code from the first R chunk and the python chunk (using py$varname) to make a phrase.
Include concluding remarks here, if any